Support workflow terminate rollback#14465
Conversation
🦋 Changeset detectedLatest commit: ce29ed6 The changes in this PR will be included in the next version bump. This PR includes changesets to release 7 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
4614a06 to
f539470
Compare
|
✅ All changesets look good |
4ccbf75 to
01fd44d
Compare
@cloudflare/autoconfig
create-cloudflare
@cloudflare/deploy-helpers
@cloudflare/kv-asset-handler
miniflare
@cloudflare/pages-shared
@cloudflare/unenv-preset
@cloudflare/vite-plugin
@cloudflare/vitest-pool-workers
@cloudflare/workers-auth
@cloudflare/workers-editor-shared
@cloudflare/workers-utils
wrangler
commit: |
f9b1b1a to
9ed9edf
Compare
|
@vaishnav-mk Bonk workflow was cancelled. View workflow run · To retry, trigger Bonk again. |
50bf663 to
2c0a39a
Compare
4bafb21 to
ef038d9
Compare
|
Codeowners approval required for this PR:
Show detailed file reviewers
|
ef038d9 to
b1be7f1
Compare
|
/bonk review |
There was a problem hiding this comment.
Review: Support workflow terminate rollback
Overall this is a well-structured change. The replay-to-rebuild-registry approach correctly addresses the "empty registry after DO restart/pause" limitation that Devin originally flagged, and the persisted hasRollback metadata is a clean way to know which cached steps to replay. Tests cover the live-registry, replayed-registry, and paused-registry cases. A few things worth addressing before merge.
Blocking / needs discussion
1. The remote (production) API path also sends rollback: true — is this intended/supported?
packages/wrangler/src/workflows/utils.ts:131 adds rollback: true to the body of the remote PATCH /accounts/.../instances/:id/status call. The PR description scopes this to "local Workflows tooling," and the engine implementation lives entirely in @cloudflare/workflows-shared (local). If the production Workflows REST API doesn't yet accept a rollback field, this either silently no-ops or could be rejected — yet the CLI unconditionally prints 🥷 The instance "..." was terminated successfully (terminate.ts:65). If the remote API isn't ready, consider gating --rollback to --local only (and erroring otherwise), so users don't get a false success message for a rollback that never ran. If the remote API is ready, the PR description should say so.
Non-blocking
2. No user-facing signal when rollback runs vs. is skipped.
This was Devin's original concern. The replay path mitigates the empty-registry case, but if getEligibleRollbackSteps() returns nothing (no eligible steps, or malformed metadata), rollbacks are silently skipped while the CLI still reports plain success. Consider surfacing in the success log whether rollbacks executed (e.g. count of rollback steps run), so the behavior is observable.
3. replayRollbackRegistry re-runs the user workflow — confirm side-effect safety for uncached steps.
engine.ts:983 calls USER_WORKFLOW.run(...) in replay phase. Cached step.do calls correctly short-circuit to their stored value (context.ts:438-455) and sleep/sleepUntil/waitForEvent return early in replay phase, so a paused/sleeping instance halts replay safely. But if an instance is terminated while a new, uncached step.do would be the next thing the user code reaches before any sleep/wait, that closure would actually execute with real side effects during replay. In practice terminate targets running/paused instances, so this is unlikely — but a comment documenting this assumption (replay relies on all rollback-eligible steps already being cached and execution halting at the next suspension point) would help future maintainers.
4. Swallowed errors in replayRollbackRegistry and userTriggeredTerminate.
engine.ts:987 catches and discards all replay errors, and engine.ts:1015 logs rollback failures only to console.error. That matches the production "best-effort" rollback model, but a malformed user workflow throwing during replay will silently produce zero rollbacks. At minimum, consider a debug-level log so local users can diagnose "why didn't my rollback run?"
Minor / nits
engine.ts:316-319insetRollbackPhase/rollbackPhaseuse a string-literal union ("replay" | "rollback" | undefined) repeated in several signatures. A small exported type alias would reduce duplication.resources/workflows.ts:923still castshandle as unknown as WorkflowHandleto callterminate(options). Same TODO as thewrapped-binding.worker.tscast — worth a single shared note/tracking issue referencing the@cloudflare/workers-typesterminate-options gap so both casts get removed together.getEligibleRollbackStepsreads allstatesrows twice and builds two sets; fine for local volumes, just noting it scans the full log table per terminate.
Tests / changeset
- Changeset (
minorfor workflows-shared/wrangler/miniflare) and wording look correct. - Wrangler tests now assert the exact request body for both local and remote
--rollback, which is good. If point (1) results in gating--rollbackto local-only, the remote test should be updated accordingly.
|
I reviewed PR #14465 and posted a review. Summary of what I found: Strengths
Main concern (blocking/discussion)
Non-blocking Nits: repeated phase string-union (extract a type), the remaining |
|
/bonk review |
There was a problem hiding this comment.
Review: Support workflow terminate rollback (re-review @ dd02374)
Re-reviewing after the two commits added since the last /bonk review (6c313a6 → dd02374). Most of the previously-raised concerns are now resolved. Details below.
Previously-raised items — status
1. Remote (production) API path sends rollback: true — now substantiated. ✅ (with one ask)
The prior review flagged utils.ts:131 as blocking because it was unclear whether the production API accepts rollback on terminate. The linked docs PR (cloudflare-docs#31769) documents the remote CLI form as a supported operation:
npx wrangler workflows instances terminate <WORKFLOW_NAME> <INSTANCE_ID> --rollbackand describes production semantics ("Workflows runs the rollback handlers registered by completed or eligible steps before the instance reaches the terminated state. Steps without rollback handlers are skipped."). That's good evidence the production API is ready and this is intentional, so I no longer consider it blocking. One remaining ask: please confirm the production PATCH .../status endpoint accepting rollback is already deployed (not gated behind the docs merge), since terminate.ts:65 prints unconditional success — if the field is silently dropped server-side today, users get a false "terminated successfully" for a rollback that never ran.
2. Devin's "failed steps won't re-register rollback handlers on replay" — resolved. ✅
dd02374 fixes this directly: the maybeError branch in context.ts:461-475 now calls #registerRollback before re-throwing the cached error, and the replay branch at context.ts:484-497 registers rollback and returns early. New test replays cached failed steps before terminate rollback when registry is empty (engine.test.ts) asserts ROLLBACK_STEP_SUCCESS fires for the failed step. Good coverage.
3. Repeated phase string-union nit — resolved. ✅
RollbackPhase type alias is now used across setRollbackPhase / rollbackPhase (engine.ts:336).
4. Double full-table scan in getEligibleRollbackSteps — resolved. ✅
Now a single SELECT ... ORDER BY id DESC pass building the sets in one loop (engine.ts:256-321).
Still open — non-blocking
5. Swallowed errors in replayRollbackRegistry.
engine.ts:988 catches and discards all replay errors with only a comment; engine.ts:1017 logs rollback failures to console.error. For a local-dev tool, "why didn't my rollback run?" is a real question. A debug-level log in the replay catch (even just the error message) would make silent zero-rollback outcomes diagnosable.
6. No user-facing signal for rollback outcome.
terminate.ts:65 prints plain success regardless of whether any rollback steps executed, were skipped, or failed. Not blocking, but surfacing a count (e.g. "ran N rollback step(s)") would make the feature observable — especially relevant given (5) and the empty-registry edge cases.
Minor / nits
resources/workflows.ts:924still useshandle as unknown as WorkflowHandleto pass terminate options, mirroring the same@cloudflare/workers-typesgap. Since the docs PR now typesterminate(options?: WorkflowInstanceTerminateOptions), worth a single tracking note so this cast is removed once the types package ships.registerRollbackFn(lib/rollback.ts) now updatesstepContext/outputin place on re-registration instead of disposing — correct for replay, but confirm the previousdupRollbackStub/disposeRollbackStublifecycle isn't leaking stubs when a cacheKey re-registers with a freshfn(the update path keepsexisting.fnand drops the newregistration.fnwithout disposing it).
Tests / changeset
- Changeset (
minorfor workflows-shared/wrangler/miniflare, patches for dependents) looks correct. - Wrangler tests assert exact request bodies for both local and remote
--rollback— good.
Overall: the design (persisted hasRollback + replay-to-rebuild-registry) is sound and the newest commit closed the last correctness gap Devin flagged. Approving pending confirmation on point (1) — production API rollback support being live.
|
Review posted to PR #14465. Summary of what I found on re-review (head commit Resolved since last review
Still open (non-blocking)
Nits
Net: sound design, last correctness gap closed; approving pending confirmation on production API readiness. |
|
/bonk rereview |
There was a problem hiding this comment.
Review: Support workflow terminate rollback (re-review @ 7b9fa6a)
Re-reviewing after the one commit added since the last /bonk review (dd02374 → 7b9fa6a, "Clarify rollback replay diagnostics"). That commit directly addresses two of the open non-blocking items from last time, so the diff is small and additive. Design remains sound.
Newly resolved since last review
Swallowed replay errors — resolved. ✅
engine.ts:988-991 now captures the replay error and emits console.debug("Rollback replay stopped:", replayErr) instead of silently discarding it. This makes the "why didn't my rollback run?" case diagnosable in local dev, which was my previous item (5). Since workflows-shared is a Worker package (not wrangler), console.debug is the right primitive here — consistent with the existing console.error/console.warn usage in this file.
registerRollbackFn re-registration stub lifecycle — clarified. ✅
lib/rollback.ts:113-114 now documents that the existing entry already owns the duped stub and duplicate registrations refresh stepContext/output only (the helper hasn't duped the incoming fn, so there's nothing to dispose). This resolves the leak concern I raised last time — confirms it's intentional, not an oversight.
Still open — non-blocking
1. Confirm production PATCH .../status accepts rollback and is deployed.
Still my only real ask. terminate.ts:65 prints unconditional 🥷 ... terminated successfully regardless of whether the remote API honored rollback: true (utils.ts:131). I checked the linked docs PR (cloudflare-docs#31769) — it's still in Draft with no approving reviews yet, so it documents intended semantics but doesn't by itself prove the production endpoint is live today. If the field is silently dropped server-side, remote users get a false success for a rollback that never ran. Please confirm the endpoint is deployed (independent of the docs merge); if it isn't, gate --rollback to --local until it is.
2. Rollback outcome still not surfaced to the user.
Nice to see executeRollbacks now returns { ranAny, allSucceeded } (lib/rollback.ts:172,225) — but both call sites (engine.ts:1016 and engine.ts:1302) discard the result, so it never reaches the CLI, and terminate.ts still prints a flat success. The plumbing is now half there; propagating a count (e.g. "ran N rollback step(s)") would close the observability gap and pairs well with item (1). Non-blocking.
Minor / nits (unchanged)
resources/workflows.ts:924still useshandle as unknown as WorkflowHandleforterminate(options), mirroring the@cloudflare/workers-typesterminate-options gap (same TODO as therestartcast at:919). Worth a single tracking note so both casts are removed once the types package ships.
Tests / changeset
- No test/changeset changes in this commit; the prior coverage (live / replayed / paused registry, failed-step re-registration asserting
ROLLBACK_STEP_SUCCESS, and exact request-body assertions for local + remote--rollback) still stands. - Changeset (
minorfor workflows-shared/wrangler/miniflare, patches for dependents) looks correct.
Overall: the last two commits closed the correctness gap (failed-step replay) and the diagnosability gap (replay debug log). The design (persisted hasRollback + replay-to-rebuild-registry) is solid. Approving pending confirmation on item (1) — production API rollback support being deployed, given the unconditional success message and the still-draft docs PR.
|
Re-review posted to PR #14465. SummaryI re-reviewed at head commit Newly resolved:
Still open (non-blocking):
Minor: the remaining Net: last two commits closed the failed-step replay correctness gap and the diagnosability gap. Design is sound; approving pending confirmation on the production API being deployed. |
| private async replayRollbackRegistry( | ||
| metadata: InstanceMetadata | ||
| ): Promise<void> { | ||
| if (this.rollbackRegistry.size > 0) { | ||
| return; | ||
| } | ||
|
|
||
| const eligible = this.getEligibleRollbackSteps(); | ||
| if (eligible.length === 0) { | ||
| return; | ||
| } | ||
|
|
||
| this.rollbackEligibleCacheKeys = new Set(eligible); | ||
| const stubStep = this.createRollbackContext(); | ||
| this.setRollbackPhase("replay"); | ||
| try { | ||
| await this.env.USER_WORKFLOW.run( | ||
| metadata.event, | ||
| stubStep as unknown as WorkflowStep | ||
| ); | ||
| } catch (replayErr) { | ||
| // Match the production engine: replay may stop on normal workflow control | ||
| // flow; rollback execution uses whatever handlers replay registered. | ||
| console.debug("Rollback replay stopped:", replayErr); | ||
| } finally { | ||
| this.setRollbackPhase(undefined); | ||
| this.rollbackEligibleCacheKeys = undefined; | ||
| } | ||
| } |
There was a problem hiding this comment.
🚩 Replay mechanism re-runs the user's workflow function to reconstruct rollback handlers
The replayRollbackRegistry method at packages/workflows-shared/src/engine.ts:968-996 re-invokes this.env.USER_WORKFLOW.run(metadata.event, stubStep) with the rollback phase set to 'replay'. During replay, step.do returns cached values (or undefined for uncached steps) and sleep/sleepUntil are no-ops. This is a powerful but fragile approach: it assumes the workflow's control flow is deterministic and that re-running it with the same event will encounter the same steps in the same order. If the workflow has non-deterministic branching (e.g., based on Date.now() or random values), replay could register rollback handlers for the wrong steps. The rollbackEligibleCacheKeys filter at packages/workflows-shared/src/engine.ts:324-331 mitigates this by only keeping handlers whose cache keys match persisted eligible steps, but a reviewer should consider whether this is sufficient for all workflow patterns.
Was this helpful? React with 👍 or 👎 to provide feedback.
Adds terminate rollback support across Workflows local tooling and the remote Wrangler terminate command.
This wires
rollback: truethrough the terminate path:@cloudflare/workflows-sharedbinding and local engineworkflows instances terminate --rollbackworkflows instances terminate --local --rollbackThe production Workflows API already accepts terminate rollback; this PR adds the SDK/Wrangler client surfaces and fixes local rollback recovery so rollback can run after local engine restart/eviction.
Rollback is only valid for terminate and is only serialized when explicitly set to
true.A picture of a cute animal (not mandatory, but encouraged)